Abstract

This report demonstrates basic exploration of COVID-19 data, with the focus on computing and graphing data on relative timelines, in which the “day zero” is unique for each country.

Learning objectives:

After this tutorial, the reader should be able to:

  1. Compute cumulative counts
  2. Plot interactive trajectories with plotly
  3. Compute relative timelines

Goals: Plotting trajectories

In this tutorial we will have three goals. First, we would like to have a tool to plot multiple trajectories onto the same canvas and explore the differences between trajectories plotted on the same scale. Second, we would like to create a unique temporal context for each trajectory and express it with respect meaningful anchors, such as first detected case or first confirmed death. Thirdly, we would like to better understand the sequence of these temporal anchors to have a better understanding of how the pandemic unfolded.

Here are the graphs we will be creating in this tutorial:

Goal 1 - Absolute, Interactive

Goal 2 - Relative, faceted

Goal 3 - Temporal sequence

Environment

Non-technical readers are welcome to skip this section.

library(magrittr) #Pipes
library(ggplot2) #For graphing
library(dplyr) # for shorter function names. but still prefer dplyr:: stems
library(lubridate) # for working with dates
library(plotly) # interactive graphs
config <- config::get()
# to be applied to every graph we will make
ggplot2::theme_set(
  ggplot2::theme_bw(
  )+
    theme(
      strip.background = element_rect(fill="grey90", color = NA)
    )
)
# important dates we will refer to in analysis
date_of_exodus   <- lubridate::date("2020-01-13") # first case outside of China
date_of_pandemic <- lubridate::date("2020-03-11") # WHO declares pandemic
# to help us focus on a manageable set of countries for purposes of demonstration
oecd_countries <- c(
  "AUS", "AUT", "BEL", "CAN", "CZE", "DNK", "EST", "FIN", "FRA",
  "DEU", "GRC", "HUN", "ISL", "IRL", "ISR", "ITA", "JPN", "KOR",
  "LVA", "LTU", "MEX", "NLD", "NZL", "NOR", "POL", "PRT", "SVK",
  "SVN", "ESP", "SWE", "CHE", "TUR", "GBR", "USA", "RUS", "ZAF"
)
# adds neat styling to your knitr table
neat <- function(x, output_format = "html"){
  # knitr.table.format = output_format
  if(output_format == "pandoc"){
    x_t <- knitr::kable(x, format = "pandoc")
  }else{
    x_t <- x %>%
      # x %>%
      # knitr::kable() %>%
      knitr::kable(format=output_format) %>%
      kableExtra::kable_styling(
        bootstrap_options = c("striped", "hover", "condensed","responsive"),
        # bootstrap_options = c( "condensed"),
        full_width = F,
        position = "left"
      )
  }
  return(x_t)
}

Data

The data comes from European Centre for Disease Prevention and Control and can be downloaded here. I discuss the preparation of this data for analysis is a separate post.

# covid data
ds_covid <- readr::read_csv(config$path_input_covid)
ds_covid %>% glimpse()
Rows: 31,365
Columns: 8
$ date              <date> 2019-12-31, 2020-01-01, 2020-01-02, 2020-01-03, 2020-01-04, 2020-01-05, 2020-01-06, 2020...
$ n_cases           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ n_deaths          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ country_code      <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG"...
$ n_population_2018 <dbl> 37172386, 37172386, 37172386, 37172386, 37172386, 37172386, 37172386, 37172386, 37172386,...
$ country_code2     <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF",...
$ country_label     <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan",...
$ continent_label   <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "...
# note that only `date`, `n_cases`, and `n_deatsh` change with time
# other variables have the same value within each country

A few things to notice about this data set:

  1. Only date , n_cases, and n_deaths change over time. All other variables have a single unique value for each country.

  2. For some countries, the observations are missing for certain dates, for example:

ds_covid %>% filter(country_code == "FIN") %>% filter(date > as_date("2020-02-26"))
# A tibble: 95 x 8
   date       n_cases n_deaths country_code n_population_2018 country_code2 country_label continent_label
   <date>       <dbl>    <dbl> <chr>                    <dbl> <chr>         <chr>         <chr>          
 1 2020-02-27       1        0 FIN                    5518050 FI            Finland       Europe         
 2 2020-02-28       0        0 FIN                    5518050 FI            Finland       Europe         
 3 2020-02-29       1        0 FIN                    5518050 FI            Finland       Europe         
 4 2020-03-01       0        0 FIN                    5518050 FI            Finland       Europe         
 5 2020-03-02       3        0 FIN                    5518050 FI            Finland       Europe         
 6 2020-03-03      NA       NA FIN                    5518050 FI            Finland       Europe         
 7 2020-03-04       1        0 FIN                    5518050 FI            Finland       Europe         
 8 2020-03-05      NA       NA FIN                    5518050 FI            Finland       Europe         
 9 2020-03-06       5        0 FIN                    5518050 FI            Finland       Europe         
10 2020-03-07       7        0 FIN                    5518050 FI            Finland       Europe         
# ... with 85 more rows

The data preparation step ensured that each country has the same number of rows, creating these missing cells. This is important for two reasons: A) we want to be able to differentiate between the absence of cases and the absence of reporting and B) missing dates will complicate comutation of relative timelines.

Data Tweaks

# to help us focus on a smaller subset of countries
oecd_countries <- c(
  "AUS", "AUT", "BEL", "CAN", "CZE", "DNK", "EST", "FIN", "FRA",
  "DEU", "GRC", "HUN", "ISL", "IRL", "ISR", "ITA", "JPN", "KOR",
  "LVA", "LTU", "MEX", "NLD", "NZL", "NOR", "POL", "PRT", "SVK",
  "SVN", "ESP", "SWE", "CHE", "TUR", "GBR", "USA", "RUS", "ZAF"
)
# to have a handy filter
ds_covid <- ds_covid %>%
  mutate(
    oecd = country_code %in% oecd_countries
  )

We focus on 36 country members of the Organization for Economic Co-operation and Development (OECD) because they report more nuanced data on their economic and social development, so we can have a richer pool of explanatory variables (see http://stats.oecd.org/ )

Goal 1

Computing cumulative (running) sum is easily accomplished with cumsum function paired with group_by, however, watch out for NA values: they will break the running sum, resulting in the NA for the rest of the column:

ds_covid <- ds_covid %>%
  group_by(country_code) %>%
  mutate(
    n_cases_cum = cumsum(n_cases) # would produce NA after en
    # n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
  ) %>%
  ungroup()
ds_covid %>%
  filter(country_code == "FIN") %>% filter(date > as_date("2020-02-26")) %>%
  select(date, country_code, n_cases, n_cases_cum)
# A tibble: 95 x 4
   date       country_code n_cases n_cases_cum
   <date>     <chr>          <dbl>       <dbl>
 1 2020-02-27 FIN                1           2
 2 2020-02-28 FIN                0           2
 3 2020-02-29 FIN                1           3
 4 2020-03-01 FIN                0           3
 5 2020-03-02 FIN                3           6
 6 2020-03-03 FIN               NA          NA
 7 2020-03-04 FIN                1          NA
 8 2020-03-05 FIN               NA          NA
 9 2020-03-06 FIN                5          NA
10 2020-03-07 FIN                7          NA
# ... with 85 more rows

To avoid this, we use convert NA to 0 on the fly with tidyr::replace_na() function:

ds_covid <- ds_covid %>%
  group_by(country_code) %>%
  mutate(
    # n_cases_cum = cumsum(n_cases) # would produce NA after en
    n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
  ) %>%
  ungroup()
ds_covid %>%
  filter(country_code == "FIN") %>% filter(date > as_date("2020-02-26")) %>%
  select(date, country_code, n_cases, n_cases_cum)
# A tibble: 95 x 4
   date       country_code n_cases n_cases_cum
   <date>     <chr>          <dbl>       <dbl>
 1 2020-02-27 FIN                1           2
 2 2020-02-28 FIN                0           2
 3 2020-02-29 FIN                1           3
 4 2020-03-01 FIN                0           3
 5 2020-03-02 FIN                3           6
 6 2020-03-03 FIN               NA           6
 7 2020-03-04 FIN                1           7
 8 2020-03-05 FIN               NA           7
 9 2020-03-06 FIN                5          12
10 2020-03-07 FIN                7          19
# ... with 85 more rows

This option is better than converting NAs to zero during the data preparation step, as this would mask the absence of reporting, overwriting it with a definitive value of 0 cases.

Now, with the cumulative case available, we can plot the trajectory in a interactive graph with plotly package:

library(plotly)
g1 <- ds_covid %>%
  filter(oecd) %>%
  highlight_key(~country_label )%>%
  plot_ly(
    x = ~ date, y = ~ log(n_cases_cum)
  ) %>%
  group_by(country_label) %>%
  add_lines()
g1 %>%
  highlight(
    on = "plotly_click"
    ,selectize = TRUE
    ,dynamic = TRUE
    ,persistent = TRUE
  )

Note that we add the highlight component with the plotly::highlight() function.

If you are familiar with ggplot2, the syntax of plotly should not be too confusing, however, it does take some time to get used to.

For more options and syntax guide, see https://plotly-r.com/client-side-linking.html. There is also a package for rendering ggplot2 into interactive graphs (ggplotly), which offers less flexibility in display design compared to creating graphs directly with plotly, but is much simpler to implement.

g1 <- ds_covid %>%
  filter(oecd) %>%
  ggplot(aes(x = date, y= log(n_cases_cum)))+
  geom_line(aes(group = country_code))

g1 <- plotly::ggplotly(g1) %>% plotly::highlight(on = "plotly_hover")
g1

Goal 2

It is often makes sense to compare the progression of epidemics across countries using a meaningful “time zero”, for example the day of the first confirmed case or the first confirmed death in the country. To help us carry out the computation, let us construct a fictional example that we can use to develop the script

# create reproducible example (reprex) to test out your function
d_reprex <- tibble::tribble(
  ~country_code, ~date, ~n_cases,
  "Alabnia", "2020-03-01", NA,
  "Alabnia", "2020-03-02", 0,
  "Alabnia", "2020-03-03", 1,
  "Alabnia", "2020-03-04", 0,
  "Alabnia", "2020-03-05", 3,
  "Botswana",   "2020-04-01", 0,
  "Botswana",   "2020-04-02", NA,
  "Botswana",   "2020-04-03", 2,
  "Botswana",   "2020-04-04", 3,
  "Botswana",   "2020-04-05", 0,
  "Chile",   "2020-05-01", 2,
  "Chile",   "2020-05-02", 0,
  "Chile",   "2020-05-03", 0,
  "Chile",   "2020-05-04", 3,
  "Chile",   "2020-05-05", 1,
 ) %>%
  mutate(
    date = lubridate::as_date(date)
  )
d_reprex  %>% neat()
country_code date n_cases
Alabnia 2020-03-01 NA
Alabnia 2020-03-02 0
Alabnia 2020-03-03 1
Alabnia 2020-03-04 0
Alabnia 2020-03-05 3
Botswana 2020-04-01 0
Botswana 2020-04-02 NA
Botswana 2020-04-03 2
Botswana 2020-04-04 3
Botswana 2020-04-05 0
Chile 2020-05-01 2
Chile 2020-05-02 0
Chile 2020-05-03 0
Chile 2020-05-04 3
Chile 2020-05-05 1

We already know how to compute cumulative counts

d_reprex_timeline <- d_reprex %>%
  group_by(country_code) %>%
  mutate(
    n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
  )
d_reprex_timeline
# A tibble: 15 x 4
# Groups:   country_code [3]
   country_code date       n_cases n_cases_cum
   <chr>        <date>       <dbl>       <dbl>
 1 Alabnia      2020-03-01      NA           0
 2 Alabnia      2020-03-02       0           0
 3 Alabnia      2020-03-03       1           1
 4 Alabnia      2020-03-04       0           1
 5 Alabnia      2020-03-05       3           4
 6 Botswana     2020-04-01       0           0
 7 Botswana     2020-04-02      NA           0
 8 Botswana     2020-04-03       2           2
 9 Botswana     2020-04-04       3           5
10 Botswana     2020-04-05       0           5
11 Chile        2020-05-01       2           2
12 Chile        2020-05-02       0           2
13 Chile        2020-05-03       0           2
14 Chile        2020-05-04       3           5
15 Chile        2020-05-05       1           6

Now we will use a simple logical test to create a logical variable indicating when the running total exceeded the value of the chosen threshold.

d_reprex_timeline <- d_reprex %>%
  group_by(country_code) %>%
  mutate(
    n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
    ,onset_case = n_cases_cum > 0
  )
d_reprex_timeline
# A tibble: 15 x 5
# Groups:   country_code [3]
   country_code date       n_cases n_cases_cum onset_case
   <chr>        <date>       <dbl>       <dbl> <lgl>     
 1 Alabnia      2020-03-01      NA           0 FALSE     
 2 Alabnia      2020-03-02       0           0 FALSE     
 3 Alabnia      2020-03-03       1           1 TRUE      
 4 Alabnia      2020-03-04       0           1 TRUE      
 5 Alabnia      2020-03-05       3           4 TRUE      
 6 Botswana     2020-04-01       0           0 FALSE     
 7 Botswana     2020-04-02      NA           0 FALSE     
 8 Botswana     2020-04-03       2           2 TRUE      
 9 Botswana     2020-04-04       3           5 TRUE      
10 Botswana     2020-04-05       0           5 TRUE      
11 Chile        2020-05-01       2           2 TRUE      
12 Chile        2020-05-02       0           2 TRUE      
13 Chile        2020-05-03       0           2 TRUE      
14 Chile        2020-05-04       3           5 TRUE      
15 Chile        2020-05-05       1           6 TRUE      

It may make sense to use other operationalization the “onset” event.. For example, we can define it as “the date of the 10th case” or “3rd death” or “.1% of population infected”. Now with the column onset_case marking whether the running total is higher than a threshold, we can identify the first occurrence of TRUE :

d_reprex_timeline <- d_reprex %>%
  group_by(country_code) %>%
  mutate(
    n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
    ,onset_case = n_cases_cum > 0
    ,first_case = cumsum(onset_case) == 1L
  )
d_reprex_timeline
# A tibble: 15 x 6
# Groups:   country_code [3]
   country_code date       n_cases n_cases_cum onset_case first_case
   <chr>        <date>       <dbl>       <dbl> <lgl>      <lgl>     
 1 Alabnia      2020-03-01      NA           0 FALSE      FALSE     
 2 Alabnia      2020-03-02       0           0 FALSE      FALSE     
 3 Alabnia      2020-03-03       1           1 TRUE       TRUE      
 4 Alabnia      2020-03-04       0           1 TRUE       FALSE     
 5 Alabnia      2020-03-05       3           4 TRUE       FALSE     
 6 Botswana     2020-04-01       0           0 FALSE      FALSE     
 7 Botswana     2020-04-02      NA           0 FALSE      FALSE     
 8 Botswana     2020-04-03       2           2 TRUE       TRUE      
 9 Botswana     2020-04-04       3           5 TRUE       FALSE     
10 Botswana     2020-04-05       0           5 TRUE       FALSE     
11 Chile        2020-05-01       2           2 TRUE       TRUE      
12 Chile        2020-05-02       0           2 TRUE       FALSE     
13 Chile        2020-05-03       0           2 TRUE       FALSE     
14 Chile        2020-05-04       3           5 TRUE       FALSE     
15 Chile        2020-05-05       1           6 TRUE       FALSE     

Notice, that we use the property of logical class: when used in mathematical expression, FALSE assumes the value of 0, while TRUE is interpreted as 1.

Now we can use this indicator to extract the date associated with this row:

d_reprex_timeline <- d_reprex %>%
  group_by(country_code) %>%
  mutate(
    n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
    ,onset_case = n_cases_cum > 0
    ,first_case = cumsum(onset_case) == 1L
    ,date_of_1case1 = ifelse(first_case, date, NA) %>% lubridate::as_date()
  )
d_reprex_timeline
# A tibble: 15 x 7
# Groups:   country_code [3]
   country_code date       n_cases n_cases_cum onset_case first_case date_of_1case1
   <chr>        <date>       <dbl>       <dbl> <lgl>      <lgl>      <date>        
 1 Alabnia      2020-03-01      NA           0 FALSE      FALSE      NA            
 2 Alabnia      2020-03-02       0           0 FALSE      FALSE      NA            
 3 Alabnia      2020-03-03       1           1 TRUE       TRUE       2020-03-03    
 4 Alabnia      2020-03-04       0           1 TRUE       FALSE      NA            
 5 Alabnia      2020-03-05       3           4 TRUE       FALSE      NA            
 6 Botswana     2020-04-01       0           0 FALSE      FALSE      NA            
 7 Botswana     2020-04-02      NA           0 FALSE      FALSE      NA            
 8 Botswana     2020-04-03       2           2 TRUE       TRUE       2020-04-03    
 9 Botswana     2020-04-04       3           5 TRUE       FALSE      NA            
10 Botswana     2020-04-05       0           5 TRUE       FALSE      NA            
11 Chile        2020-05-01       2           2 TRUE       TRUE       2020-05-01    
12 Chile        2020-05-02       0           2 TRUE       FALSE      NA            
13 Chile        2020-05-03       0           2 TRUE       FALSE      NA            
14 Chile        2020-05-04       3           5 TRUE       FALSE      NA            
15 Chile        2020-05-05       1           6 TRUE       FALSE      NA            

There will be only one date associated with traspassing the threshold, so we populate the rest of the cells with it

d_reprex_timeline <- d_reprex %>%
  group_by(country_code) %>%
  mutate(
    n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
    ,onset_case = n_cases_cum > 0
    ,first_case = cumsum(onset_case) == 1L
    ,date_of_1case1 = ifelse(first_case, date, NA) %>% lubridate::as_date()
    ,date_of_1case2 = min(date_of_1case1, na.rm = T)
  )
d_reprex_timeline
# A tibble: 15 x 8
# Groups:   country_code [3]
   country_code date       n_cases n_cases_cum onset_case first_case date_of_1case1 date_of_1case2
   <chr>        <date>       <dbl>       <dbl> <lgl>      <lgl>      <date>         <date>        
 1 Alabnia      2020-03-01      NA           0 FALSE      FALSE      NA             2020-03-03    
 2 Alabnia      2020-03-02       0           0 FALSE      FALSE      NA             2020-03-03    
 3 Alabnia      2020-03-03       1           1 TRUE       TRUE       2020-03-03     2020-03-03    
 4 Alabnia      2020-03-04       0           1 TRUE       FALSE      NA             2020-03-03    
 5 Alabnia      2020-03-05       3           4 TRUE       FALSE      NA             2020-03-03    
 6 Botswana     2020-04-01       0           0 FALSE      FALSE      NA             2020-04-03    
 7 Botswana     2020-04-02      NA           0 FALSE      FALSE      NA             2020-04-03    
 8 Botswana     2020-04-03       2           2 TRUE       TRUE       2020-04-03     2020-04-03    
 9 Botswana     2020-04-04       3           5 TRUE       FALSE      NA             2020-04-03    
10 Botswana     2020-04-05       0           5 TRUE       FALSE      NA             2020-04-03    
11 Chile        2020-05-01       2           2 TRUE       TRUE       2020-05-01     2020-05-01    
12 Chile        2020-05-02       0           2 TRUE       FALSE      NA             2020-05-01    
13 Chile        2020-05-03       0           2 TRUE       FALSE      NA             2020-05-01    
14 Chile        2020-05-04       3           5 TRUE       FALSE      NA             2020-05-01    
15 Chile        2020-05-05       1           6 TRUE       FALSE      NA             2020-05-01    

This allows us for a very straightforward computation of the distance between any given date and the date of the “onset”, in this case the date of the first confirmed case:

d_reprex_timeline <- d_reprex %>%
  group_by(country_code) %>%
  mutate(
    n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
    #,onset_case = n_cases_cum > 0
    # ,first_case = cumsum(onset_case) == 1L
    # ,date_of_1case1 = ifelse(first_case, date, NA) %>% lubridate::as_date()
    # ,date_of_1case2 = min(date_of_1case1, na.rm = T)
    # alternatively, as a single sentence:
    ,date_of_1case = ifelse(cumsum(n_cases_cum > 0) == 1L, date, NA) %>%
      min(na.rm=T) %>%
      lubridate::as_date()
    # relative timeline
    ,days_since_1case = (date - date_of_1case) %>% as.integer()
    ) %>%
  ungroup() %>%
  select(-date_of_1case) # because it is easily derived
d_reprex_timeline %>% neat()
country_code date n_cases n_cases_cum days_since_1case
Alabnia 2020-03-01 NA 0 -2
Alabnia 2020-03-02 0 0 -1
Alabnia 2020-03-03 1 1 0
Alabnia 2020-03-04 0 1 1
Alabnia 2020-03-05 3 4 2
Botswana 2020-04-01 0 0 -2
Botswana 2020-04-02 NA 0 -1
Botswana 2020-04-03 2 2 0
Botswana 2020-04-04 3 5 1
Botswana 2020-04-05 0 5 2
Chile 2020-05-01 2 2 0
Chile 2020-05-02 0 2 1
Chile 2020-05-03 0 2 2
Chile 2020-05-04 3 5 3
Chile 2020-05-05 1 6 4

Finally, we can re-express these steps more succinctly, however, it might be advisible to leave these step in comments in case you need to retrace your steps or debug an error down the stream

d_reprex_timeline <- d_reprex %>%
  group_by(country_code) %>%
  mutate(
    n_cases_cum = cumsum(tidyr::replace_na(n_cases,0))
    #,onset_case = n_cases_cum > 0
    # ,first_case = cumsum(onset_case) == 1L
    # ,date_of_1case1 = ifelse(first_case, date, NA) %>% lubridate::as_date()
    # ,date_of_1case2 = min(date_of_1case1, na.rm = T)
    # alternatively, as a single sentence:
    ,date_of_1case = ifelse(cumsum(n_cases_cum > 0) == 1L, date, NA) %>%
      min(na.rm=T) %>%
      lubridate::as_date()
    # relative timeline
    ,days_since_1case = (date - date_of_1case) %>% as.integer()
    ) %>%
  ungroup() %>%
  select(-date_of_1case) # because it is easily derived
d_reprex_timeline %>% neat()
country_code date n_cases n_cases_cum days_since_1case
Alabnia 2020-03-01 NA 0 -2
Alabnia 2020-03-02 0 0 -1
Alabnia 2020-03-03 1 1 0
Alabnia 2020-03-04 0 1 1
Alabnia 2020-03-05 3 4 2
Botswana 2020-04-01 0 0 -2
Botswana 2020-04-02 NA 0 -1
Botswana 2020-04-03 2 2 0
Botswana 2020-04-04 3 5 1
Botswana 2020-04-05 0 5 2
Chile 2020-05-01 2 2 0
Chile 2020-05-02 0 2 1
Chile 2020-05-03 0 2 2
Chile 2020-05-04 3 5 3
Chile 2020-05-05 1 6 4

Compute relative timelines

ds_covid_timeline <- ds_covid %>%
  group_by(country_code) %>%
  mutate(
    # compute timeline of cumulative confirmed cases
    n_cases_cum  = cumsum(tidyr::replace_na(n_cases,0))
    ,date_of_1case = ifelse(cumsum(n_cases_cum > 0) == 1L, date, NA) %>%
      min(na.rm=T) %>%
      lubridate::as_date()
    ,days_since_1case = (date - date_of_1case) %>% as.integer()
    # compute timeine of cumulative deaths
    ,n_deaths_cum  = cumsum(tidyr::replace_na(n_deaths,0))
    ,date_of_1death = ifelse(cumsum(n_deaths_cum > 0) == 1L, date, NA) %>%
      min(na.rm=T) %>%
      lubridate::as_date()
    ,days_since_1death = (date - date_of_1death) %>% as.integer()
    # compute absolute timeline
    ,days_since_exodus   = as.integer(date - date_of_exodus) # first case outside of china
    ,days_since_pandemic = as.integer(date - date_of_pandemic) # WHO declares pandemic
    ,n_deaths_cum_per_1m = as.integer(n_deaths_cum/n_population_2018*1000000)
    ,n_cases_cum_per_1m  = as.integer(n_cases_cum/ n_population_2018*1000000)
  ) %>%
  ungroup() %>%
  select(-date_of_1case, -date_of_1death) %>%
  select(
    date, country_code, n_cases, n_deaths, n_cases_cum, n_deaths_cum,
    days_since_1case, days_since_1death, days_since_exodus, days_since_pandemic,
    dplyr::everything()
  )
ds_covid_timeline %>% glimpse()
Rows: 31,365
Columns: 17
$ date                <date> 2019-12-31, 2020-01-01, 2020-01-02, 2020-01-03, 2020-01-04, 2020-01-05, 2020-01-06, 20...
$ country_code        <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AF...
$ n_cases             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ n_deaths            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ n_cases_cum         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ n_deaths_cum        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ days_since_1case    <int> -56, -55, -54, -53, -52, -51, -50, -49, -48, -47, -46, -45, -44, -43, -42, -41, -40, -3...
$ days_since_1death   <int> -84, -83, -82, -81, -80, -79, -78, -77, -76, -75, -74, -73, -72, -71, -70, -69, -68, -6...
$ days_since_exodus   <int> -13, -12, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1...
$ days_since_pandemic <int> -71, -70, -69, -68, -67, -66, -65, -64, -63, -62, -61, -60, -59, -58, -57, -56, -55, -5...
$ n_population_2018   <dbl> 37172386, 37172386, 37172386, 37172386, 37172386, 37172386, 37172386, 37172386, 3717238...
$ country_code2       <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF...
$ country_label       <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan...
$ continent_label     <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia",...
$ oecd                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
$ n_deaths_cum_per_1m <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ n_cases_cum_per_1m  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

Goal 2 graph

g1 <- d1 %>%
  ggplot(aes(
    x = days_since_exodus
    ,y = n_cases_cum
  ))+
  geom_line()+
  facet_wrap(~country_label, scale = "free", ncol = 6)+
  geom_point(
    data  = d1 %>% filter(days_since_1case == 1)
    ,size = 2, fill = "#1b9e77", color = "black", alpha = .5, shape = 21
  )+
  geom_point(
    data  = d1 %>% filter(days_since_1death == 1)
    ,size = 2, fill = "#d95f02", color = "black", alpha = .5, shape = 21
  )+
  geom_vline(xintercept = 58, linetype = "dotted",)+
  geom_vline(xintercept = 75, linetype = "dashed", alpha = .5)+
  geom_vline(xintercept = 100, linetype = "dashed", color = "red", alpha = .5)+
  scale_x_continuous(breaks = seq(0,100, 50))+
  labs(
    title = "Timeline of COVID-19: Cumulative Cases"
    , y = "Cumulative Cases (in thousands)", x = "Days since first case outside of China (Jan 13, 2020)"
    , caption = "(first dot) = 1st confirmed case, (second dot) = 1st confirmed death,
    (dotted line) = pandemic announced by WHO, (dashed lines) = 75 and 100th day since Exodus"
  )
cat("\n## Cases\n")

Cases

g1

# cat("\n## Cases per 1m\n")
# g1 + aes(y = n_cases_cum_per_1m)+labs(y = "Cumulative Cases per 1 mil (in thousands)",
#                                       title = "Timeline of COVID-19: Cumulative Cases per 1 million")
# cat("\n## Deaths\n")
# g1 + aes(y = n_deaths_cum)+labs(y = "Cumulative Deaths",
#                                 title = "Timeline of COVID-19: Cumulative Deaths")
# cat("\n## Deaths per 1m\n")
# g1 + aes(y = n_deaths_cum_per_1m)+labs(y = "Cumulative Deaths per 1 mil",
#                                        title = "Timeline of COVID-19: Cumulative Deaths per 1 million")

Goal 3

g3 <- ds_covid_timeline %>%
  filter(oecd) %>%
  filter(days_since_1case == 0) %>%
  mutate(
    country_label                = forcats::fct_reorder(country_label, days_since_exodus)
    ,days_to_1death_since_exodus = (-1*days_since_1death) + days_since_exodus
  ) %>%
  ggplot(aes(x = days_since_exodus, y = country_label))+
  geom_segment(aes(yend = country_label, xend = 0), linetype=NA, alpha = .8)+
  geom_segment(aes(yend = country_label, xend = days_to_1death_since_exodus, color = "red"))+
  geom_point(shape = 21, size =2, alpha = .6, fill = "#1b9e77")+
  geom_point(aes(x = days_to_1death_since_exodus), shape = 21, size =2, alpha = .6, fill = "#d95f02")+
  geom_text(aes(label = country_code2, x = days_to_1death_since_exodus), hjust = -1, size = 3, color = "grey60")+
  scale_x_continuous(breaks = seq(0,140, 20))+
  guides(color = F)+
  labs(
    title = "COVID Timeline: Days to 1st Case and 1st Death"
    ,x = "Days to first case since exodus (January 13)", y = NULL
    ,caption = "(green dot) = 1st confirmed case, (orange dot) = 1st confirmed death"
  )
g3

session information

For the sake of documentation and reproducibility, the current report was rendered in the following environment. Click the line below to expand.

Environment

- Session info -------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.6.3 (2020-02-29)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RTerm                       
 language (EN)                        
 collate  English_United States.1252  
 ctype    English_United States.1252  
 tz       America/New_York            
 date     2020-05-31                  

- Packages -----------------------------------------------------------------------------------------------------------
 package     * version date       lib source        
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.2)
 backports     1.1.5   2019-10-02 [1] CRAN (R 3.6.1)
 callr         3.4.3   2020-03-28 [1] CRAN (R 3.6.3)
 cli           2.0.2   2020-02-28 [1] CRAN (R 3.6.3)
 codetools     0.2-16  2018-12-24 [2] CRAN (R 3.6.3)
 colorspace    1.4-1   2019-03-18 [1] CRAN (R 3.6.1)
 config        0.3     2018-03-27 [1] CRAN (R 3.6.3)
 crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.2)
 crosstalk   * 1.0.0   2016-12-21 [1] CRAN (R 3.6.2)
 data.table    1.12.8  2019-12-09 [1] CRAN (R 3.6.2)
 desc          1.2.0   2018-05-01 [1] CRAN (R 3.6.2)
 devtools      2.3.0   2020-04-10 [1] CRAN (R 3.6.3)
 digest        0.6.25  2020-02-23 [1] CRAN (R 3.6.3)
 dplyr       * 0.8.5   2020-03-07 [1] CRAN (R 3.6.3)
 ellipsis      0.3.0   2019-09-20 [1] CRAN (R 3.6.2)
 evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.2)
 fansi         0.4.1   2020-01-08 [1] CRAN (R 3.6.2)
 farver        2.0.3   2020-01-16 [1] CRAN (R 3.6.2)
 fastmap       1.0.1   2019-10-08 [1] CRAN (R 3.6.2)
 forcats       0.4.0   2019-02-17 [1] CRAN (R 3.6.2)
 fs            1.3.1   2019-05-06 [1] CRAN (R 3.6.2)
 generics      0.0.2   2018-11-29 [1] CRAN (R 3.6.2)
 ggplot2     * 3.2.1   2019-08-10 [1] CRAN (R 3.6.2)
 glue          1.4.0   2020-04-03 [1] CRAN (R 3.6.3)
 gtable        0.3.0   2019-03-25 [1] CRAN (R 3.6.2)
 hms           0.5.3   2020-01-08 [1] CRAN (R 3.6.2)
 htmltools     0.4.0   2019-10-04 [1] CRAN (R 3.6.2)
 htmlwidgets   1.5.1   2019-10-08 [1] CRAN (R 3.6.2)
 httpuv        1.5.2   2019-09-11 [1] CRAN (R 3.6.2)
 httr          1.4.1   2019-08-05 [1] CRAN (R 3.6.2)
 jsonlite      1.6.1   2020-02-02 [1] CRAN (R 3.6.2)
 knitr       * 1.28    2020-02-06 [1] CRAN (R 3.6.2)
 later         1.0.0   2019-10-04 [1] CRAN (R 3.6.2)
 lazyeval      0.2.2   2019-03-15 [1] CRAN (R 3.6.2)
 lifecycle     0.2.0   2020-03-06 [1] CRAN (R 3.6.3)
 lubridate   * 1.7.8   2020-04-06 [1] CRAN (R 3.6.3)
 magrittr    * 1.5     2014-11-22 [1] CRAN (R 3.6.2)
 memoise       1.1.0   2017-04-21 [1] CRAN (R 3.6.2)
 mime          0.9     2020-02-04 [1] CRAN (R 3.6.2)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 3.6.2)
 pillar        1.4.3   2019-12-20 [1] CRAN (R 3.6.2)
 pkgbuild      1.0.6   2019-10-09 [1] CRAN (R 3.6.2)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 3.6.2)
 pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.6.2)
 plotly      * 4.9.2   2020-02-12 [1] CRAN (R 3.6.2)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 3.6.2)
 processx      3.4.2   2020-02-09 [1] CRAN (R 3.6.2)
 promises      1.1.0   2019-10-04 [1] CRAN (R 3.6.2)
 ps            1.3.2   2020-02-13 [1] CRAN (R 3.6.2)
 purrr         0.3.4   2020-04-17 [1] CRAN (R 3.6.3)
 R6            2.4.1   2019-11-12 [1] CRAN (R 3.6.2)
 Rcpp          1.0.4.6 2020-04-09 [1] CRAN (R 3.6.3)
 readr         1.3.1   2018-12-21 [1] CRAN (R 3.6.2)
 remotes       2.1.1   2020-02-15 [1] CRAN (R 3.6.2)
 rlang         0.4.5   2020-03-01 [1] CRAN (R 3.6.3)
 rmarkdown     2.1     2020-01-20 [1] CRAN (R 3.6.2)
 rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.6.2)
 scales        1.1.0   2019-11-18 [1] CRAN (R 3.6.2)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.2)
 shiny         1.4.0   2019-10-10 [1] CRAN (R 3.6.2)
 stringi       1.4.6   2020-02-17 [1] CRAN (R 3.6.2)
 stringr       1.4.0   2019-02-10 [1] CRAN (R 3.6.2)
 testthat      2.3.2   2020-03-02 [1] CRAN (R 3.6.3)
 tibble        3.0.1   2020-04-20 [1] CRAN (R 3.6.3)
 tidyr         1.0.2   2020-01-24 [1] CRAN (R 3.6.2)
 tidyselect    1.0.0   2020-01-27 [1] CRAN (R 3.6.2)
 usethis       1.6.0   2020-04-09 [1] CRAN (R 3.6.3)
 utf8          1.1.4   2018-05-24 [1] CRAN (R 3.6.2)
 vctrs         0.2.4   2020-03-10 [1] CRAN (R 3.6.3)
 viridisLite   0.3.0   2018-02-01 [1] CRAN (R 3.6.2)
 withr         2.1.2   2018-03-15 [1] CRAN (R 3.6.2)
 xfun          0.12    2020-01-13 [1] CRAN (R 3.6.2)
 xtable        1.8-4   2019-04-21 [1] CRAN (R 3.6.2)
 yaml          2.2.1   2020-02-01 [1] CRAN (R 3.6.2)

[1] C:/Users/an499583/Documents/R/win-library/3.6
[2] C:/Users/an499583/Documents/R/R-3.6.3/library